Execution Primitives for Scalable Joins and Aggregations in Map Reduce
نویسندگان
چکیده
Analytics on Big Data is critical to derive business insights and drive innovation in today’s Internet companies. Such analytics involve complex computations on large datasets, and are typically performed on MapReduce based frameworks such as Hive and Pig. However, in our experience, these systems are still quite limited in performing at scale. In particular, calculations that involve complex joins and aggregations, e.g. statistical calculations, scale poorly on these systems. In this paper we propose novel primitives for scaling such calculations. We propose a new data model for organizing datasets into calculation data units that are organized based on user-defined cost functions. We propose new operators that take advantage of these organized data units to significantly speed up joins and aggregations. Finally, we propose strategies for dividing the aggregation load uniformly across worker processes that are very e↵ective in avoiding skews and reducing (or in some cases even removing) the associated overheads. We have implemented all our proposed primitives in a framework called Rubix, which has been in production at LinkedIn for nearly a year. Rubix powers several applications and processes TBs of data each day. We have seen remarkable improvements in speed and cost of complex calculations due to these primitives.
منابع مشابه
Cascading map-side joins over HBase for scalable join processing
One of the major challenges in large-scale data processing with MapReduce is the smart computation of joins. Since Semantic Web datasets published in RDF have increased rapidly over the last few years, scalable join techniques become an important issue for SPARQL query processing as well. In this paper, we introduce the Map-Side Index Nested Loop Join (MAPSIN join) which combines scalable index...
متن کاملOptimization of Complex SPARQL Analytical Queries
Analytical queries are crucial for many emerging Semantic Web applications such as clinical-trial recruiting in Life Sciences that incorporate patient and drug profile data. Such queries compare aggregates over multiple groupings of data which pose challenges in expression and optimization of complex grouping-aggregation constraints. While these challenges have been addressed in relational mode...
متن کاملMulti-Join Query Optimization for Read-Optimized Data Warehouse in a Cloud Environment
Read-Optimized databases are well suited for read intensive Data Warehouse applications. In addition, data in these applications grow rapidly and hence need a dynamically scalable environment like Cloud. Cloud provides a flexible environment where user can load data, execute queries and scale resources on demand. As the resources are scaled up, the number of nodes involved in the execution of q...
متن کاملThe MemSQL Query Optimizer: A modern optimizer for real-time analytics in a distributed database
Real-time analytics on massive datasets has become a very common need in many enterprises. These applications require not only rapid data ingest, but also quick answers to analytical queries operating on the latest data. MemSQL is a distributed SQL database designed to exploit memory-optimized, scale-out architecture to enable real-time transactional and analytical workloads which are fast, hig...
متن کاملRDFChain: Chain Centric Storage for Scalable Join Processing of RDF Graphs using MapReduce and HBase
As a massive linked open data is available in RDF, the scalable storage and efficient retrieval using MapReduce have been actively studied. Most of previous researches focus on reducing the number of MapReduce jobs for processing join operations in SPARQL queries. However, the cost of shuffle phase still occurs due to their reduce-side joins. In this paper, we propose RDFChain which supports th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- PVLDB
دوره 7 شماره
صفحات -
تاریخ انتشار 2014